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Abstract 

Compressive sensing (CS) is an emerging field based on the revelation that a small 
collection of linear projections of a sparse signal contains enough information for sta- 
ble, sub-Nyquist signal acquisition. When a statistical characterization of the signal is 
available, Bayesian inference can complement conventional CS methods based on lin- 
ear programming or greedy algorithms. We perform approximate Bayesian inference 
using belief propagation (BP) decoding, which represents the CS encoding matrix as 
a graphical model. Fast computation is obtained by reducing the size of the graph- 
ical model with sparse encoding matrices. To decode a length-iV signal containing 
K large coefficients, our CS-BP decoding algorithm uses 0(K log(iV)) measurements 
and 0(N log 2 (A r )) computation. Finally, although we focus on a two-state mixture 
Gaussian model, CS-BP is easily adapted to other signal models. 

1 Introduction 

Many signal processing applications require the identification and estimation of a few signif- 
icant coefficients from a high- dimensional vector. The wisdom behind this is the ubiquitous 
compressibility of signals: in an appropriate basis, most of the information contained in a 
signal often resides in just a few large coefficients. Traditional sensing and processing first 
acquires the entire data, only to later throw away most coefficients and retain the few sig- 
nificant ones [2] . Interestingly, the information contained in the few large coefficients can be 
captured (encoded) by a small number of random linear projections [3]. The ground-breaking 
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work in compressive sensing (CS) [4-6] has proved for a variety of settings that the signal 
can then be decoded in a computationally feasible manner from these random projections. 

1.1 Compressive sensing 

Sparsity and random encoding: In a typical compressive sensing (CS) setup, a signal 
vector x G has the form x = ^9, where \1/ G ]R 7VxAr is an orthonormal basis, and G 
satisfies ||#||o = K <C iV. 1 Owing to the sparsity of £ relative to the basis there is no 
need to sample all N values of x. Instead, the CS theory establishes that x can be decoded 
from a small number of projections onto an incoherent set of measurement vectors [4,5]. 
To measure (encode) x, we compute M <^ N linear projections of x via the matrix-vector 
multiplication y = $x where <£> G R MxJV is the encoding matrix. 

In addition to strictly sparse signals where ||6>|| < K, other signal models are possible. 
Approximately sparse signals have K N large coefficients, while the remaining coefficients 
are small but not necessarily zero. Compressible signals have coefficients that, when sorted, 
decay quickly according to a power law. Similarly, both noiseless and noisy signals and 
measurements may be considered. We emphasize noiseless measurement of approximately 
sparse signals in the paper. 

Decoding via sparsity: Our goal is to decode x given y and $. Although decoding x 
from y = appears to be an ill-posed inverse problem, the prior knowledge of sparsity in x 
enables to decode x from M <ti N measurements. Decoding often relies on an optimization, 
which searches for the sparsest coefficients 9 that agree with the measurements y. If M is 
sufficiently large and 9 is strictly sparse, then 9 is the solution to the £ minimization: 

9 = argmin ||0|| o s.t. y = 

Unfortunately, solving this £ optimization is NP-complete [7] . 

The revelation that supports the CS theory is that a computationally tractable optimiza- 
tion problem yields an equivalent solution. We need only solve for the ^-sparsest coefficients 
that agree with the measurements y [4,5]: 

9 = argmin s.t. y = <&\I>0, (1) 

as long as satisfies some technical conditions, which are satisfied with overwhelming 
probability when the entries of <3> are independent and identically distributed (iid) sub- 
Gaussian random variables [4]. This l\ optimization problem (1), also known as Basis 
Pursuit [8], can be solved with linear programming methods. The i\ decoder requires only 
M = 0(K\og(N/K)) projections [9, 10]. However, encoding by a dense Gaussian $ is slow, 
and £i decoding requires cubic computation in general [11]. 

1.2 Fast CS decoding 

While l\ decoders figure prominently in the CS literature, their cubic complexity still renders 
them impractical for many applications. For example, current digital cameras acquire images 

1 We use || • || o to denote the 4) "norm" that counts the number of non-zero elements. 
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with iV = 10 6 pixels or more, and fast decoding is critical. The slowness of l\ decoding has 
motivated a flurry of research into faster algorithms. 

One line of research involves iterative greedy algorithms. The Matching Pursuit (MP) [12] 
algorithm, for example, iteratively selects the vectors from the matrix that contain most 
of the energy of the measurement vector y. MP has been proven to successfully decode the 
acquired signal with high probability [12, 13]. Algorithms inspired by MP include OMP [12], 
tree matching pursuit [14], stagewise OMP [15], CoSaMP [16], IHT [17], and Subspace 
Pursuit [18] have been shown to attain similar guarantees to those of their optimization- 
based counterparts [19-21]. 

While the CS algorithms discussed above typically use a dense $ matrix, a class of meth- 
ods has emerged that employ structured <3>. For example, subsampling an orthogonal basis 
that admits a fast implicit algorithm also leads to fast decoding [4] . Encoding matrices that 
are themselves sparse can also be used. Cormode and Muthukrishnan proposed fast stream- 
ing algorithms based on group testing [22,23], which considers subsets of signal coefficients 
in which we expect at most one "heavy hitter" coefficient to lie. Gilbert et al. [24] propose 
the Chaining Pursuit algorithm, which works best for extremely sparse signals. 

1.3 Bayesian CS 

CS decoding algorithms rely on the sparsity of the signal x. In some applications, a statisti- 
cal characterization of the signal is available, and Bayesian inference offers the potential for 
more precise estimation of a; or a reduction in the number of CS measurements. Ji et al. [25] 
have proposed a Bayesian CS framework where relevance vector machines are used for signal 
estimation. For certain types of hierarchical priors, their method can approximate the pos- 
terior density of x and is somewhat faster than l\ decoding. Seeger and Nickisch [26] extend 
these ideas to experimental design, where the encoding matrix is designed sequentially based 
on previous measurements. Another Bayesian approach by Schniter et al. [27] approximates 
conditional expectation by extending the maximal likelihood approach to a weighted mixture 
of the most likely models. There are also many related results on application of Bayesian 
methods to sparse inverse problems (c.f. [28] and references therein). 

Bayesian approaches have also been used for multiuser decoding (MUD) in communi- 
cations. In MUD, users modulate their symbols with different spreading sequences, and 
the received signals are superpositions of sequences. Because most users are inactive, MUD 
algorithms extract information from a sparse superposition in a manner analogous to CS 
decoding. Guo and Wang [29] perform MUD using sparse spreading sequences and decode 
via belief propagation (BP) [30-35]; our paper also uses sparse encoding matrices and BP 
decoding. A related algorithm for decoding low density lattice codes (LDLC) by Sommer 
et al. [36] uses BP on a factor graph whose self and edge potentials are Gaussian mix- 
tures. Convergence results for the LDLC decoding algorithm have been derived for Gaussian 
noise [36]. 
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1.4 Contributions 



In this paper, we develop a sparse encoder matrix $ and a belief propagation (BP) decoder to 
accelerate CS encoding and decoding under the Bayesian framework. We call our algorithm 
CS-BP. Although we emphasize a two-state mixture Gaussian model as a prior for sparse 
signals, CS-BP is flexible to variations in the signal and measurement models. 

Encoding by sparse CS matrix: The dense sub-Gaussian CS encoding matrices [4, 5] 
are reminiscent of Shannon's random code constructions. However, although dense matrices 
capture the information content of sparse signals, they may not be amenable to fast encoding 
and decoding. Low density parity check (LDPC) codes [37,38] offer an important insight: 
encoding and decoding are fast, because multiplication by a sparse matrix is fast; nonetheless, 
LDPC codes achieve rates close to the Shannon limit. Indeed, in a previous paper [39], we 
used an LDPC-like sparse $ for the special case of noiseless measurement of strictly sparse 
signals; similar matrices were also proposed for CS by Berinde and Indyk [40]. Although 
LDPC decoding algorithms may not have provable convergence, the recent extension of 
LDPC to LDLC codes [36] offers provable convergence, which may lead to similar future 
results for CS decoding. 

We encode (measure) the signal using sparse Rademacher ({0, 1, —1}) LDPC-like $ ma- 
trices. Because entries of $ are restricted to {0,1,-1}, encoding only requires sums and 
differences of small subsets of coefficient values of x. The design of <£>, including charac- 
teristics such as column and row weights, is based on the relevant signal and measurement 
models, as well as the accompanying decoding algorithm. 

Decoding by BP: We represent the sparse $ as a sparse bipartite graph. In addition to 
accelerating the algorithm, the sparse structure reduces the number of loops in the graph and 
thus assists the convergence of a message passing method that solves a Bayesian inference 
problem. Our estimate for x explains the measurements while offering the best match to 
the prior. We employ BP in a manner similar to LDPC channel decoding [34,37,38]. To 
decode a length- N signal containing K large coefficients, our CS-BP decoding algorithm uses 
M = 0(K\og(N)) measurements and 0(N\og 2 (N)) computation. Although CS-BP is not 
guaranteed to converge, numerical results are quite favorable. 

The remainder of the paper is organized as follows. Section 2 defines our signal model, 
and Section 3 describes our sparse CS-LDPC encoding matrix. The CS-BP decoding algo- 
rithm is described in Section 4, and its performance is demonstrated numerically in Section 5. 
Variations and applications are discussed in Section 6, and Section 7 concludes. 

2 Mixture Gaussian signal model 

We focus on a two-state mixture Gaussian model [41-43] as a prior that succinctly captures 
our prior knowledge about approximate sparsity of the signal. Bayesian inference using a 
two-state mixture model has been studied well before the advent of CS, for example by 
George and McCulloch [44] and Geweke [45]; the model was proposed for CS in [1] and 
also used by He and Carin [46]. More formally, let X = [X(l), . . . , X(N)j be a random 
vector in M. N , and consider the signal x = [x(l), . . . , x(N)] as an outcome of X. Because our 
approximately sparse signal consists of a small number of large coefficients and a large number 
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f(X\Q = 0) f(X\Q = l) f(X) 




Figure 1: Mixture Gaussian model for signal coefficients. The distribution of X conditioned on 
the two state variables, Q = and Q = 1, is depicted. Also shown is the overall distribution for X. 

of small coefficients, we associate each probability density function (pdf) f(X(i)) with a state 
variable Q(i) that can take on two values. Large and small magnitudes correspond to zero 
mean Gaussian distributions with high and low variances, which are implied by Q(i) = 1 
and Q(i) = 0, respectively, 

/(X(i)|Q(i) = l)~jV(0,(7?) and /(X(z)|gW = 0)~AT(0,a 2 ), 

with o\ > o"q. Let Q = [Q(l), . . . ,Q(N)] be the state random vector associated with 
the signal; the actual configuration q = [q(l), . . . ,q(N)] G {0,1}^ is one of 2 N possible 
outcomes. We assume that the Q(i)'s are iid. 2 To ensure that we have approximately K 
large coefficients, we choose the probability mass function (pmf) of the state variable Q(i) 
to be Bernoulli with Pr (Q(i) = 1) = S and Pr (Q(i) = 0) = f - S, where S = K/N is the 
sparsity rate. 

The resulting model for signal coefficients is a two-state mixture Gaussian distribution, as 
illustrated in Figure 1. This mixture model is completely characterized by three parameters: 
the sparsity rate S and the variances <r 2 and <j\ of the Gaussian pdf 's corresponding to each 
state. 

Mixture Gaussian models have been successfully employed in image processing and infer- 
ence problems, because they are simple yet effective in modeling real- world signals [41-43]. 
Theoretical connections have also been made between wavelet coefficient mixture models and 
the fundamental parameters of Besov spaces, which have proved invaluable for characterizing 
real-world images. Moreover, arbitrary densities with a finite number of discontinuities can 
be approximated arbitrarily closely by increasing the number of states and allowing non- 
zero means [47]. We leave these extensions for future work, and focus on two-state mixture 
Gaussian distributions for modeling the signal coefficients. 

3 Sparse encoding 

Sparse CS encoding matrix: We use a sparse $ matrix to accelerate both CS encoding 
and decoding. Our CS encoding matrices are dominated by zero entries, with a small number 
of non- zeros in each row and each column. We focus on CS-LDPC matrices whose non-zero 

2 The model can be extended to capture dependencies between coefficients, as suggested by Ji et al. [25]. 
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Figure 2: Factor graph depicting the relationship between variable nodes (black) and constraint 
nodes (white) in CS-BP. 



entries are { — 1, l}; 3 each measurement involves only sums and differences of a small subset 
of coefficients of x. Although the coherence between a sparse $ and which is the maximal 
inner product between rows of $ and may be higher than the coherence using a dense $ 
matrix [48] , as long as <3> is not too sparse (see Theorem 1 below) the measurements capture 
enough information about x to decode the signal. A CS-LDPC $ can be represented as a 
bipartite graph G, which is also sparse. Each edge of G connects a coefficient node x(i) to 
an encoding node y(j) and corresponds to a non-zero entry of $ (Figure 2). 

In addition to the core structure of we can introduce other constraints to tailor the 
measurement process to the signal model. The constant row weight constraint makes sure 
that each row of $ contains exactly L non-zero entries. The row weight L can be chosen 
based on signal properties such as sparsity, possible measurement noise, and details of the 
decoding process. Another option is to use a constant column weight constraint, which fixes 
the number of non-zero entries in each column of $ to be a constant R. 

Although our emphasis is on noiseless measurement of approximately sparse signals, we 
briefly discuss noisy measurement of a strictly sparse signal, and show that a constant row 
weight L ensures that the measurements are approximated by two-state mixture Gaussians. 
To see this, consider a strictly sparse x with sparsity rate S and Gaussian variance a\. We 
now have y = §x + z, where z ~ A/"(0, af ) is additive white Gaussian noise (AWGN) with 
variance o~ z . In our approximately sparse setting, each row of $ picks up ~ L(l — S) small 
magnitude coefficients. If L(l — S)o~q ~ a z , then the few large coefficients will be obscured 
by similar noise artifacts. 

Our definition of $ relies on the implicit assumption that x is sparse in the canonical 
sparsifying basis, i.e., ^ = I . In contrast, if x is sparse in some other basis then more 
complicated encoding matrices may be necessary. We defer the discussion of these issues to 
Section 6, but emphasize that in many practical situations our methods can be extended to 

3 CS-LDPC matrices are slightly different from LDPC parity check matrices, which only contain the binary 
entries and 1. We have observed numerically that allowing negative entries offers improved performance. 
At the expense of additional computation, further minor improvement can be attained using sparse matrices 
with Gaussian non-zero entries. 
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support the sparsifying basis \& in a computationally tractable manner. 

Information content of sparsely encoded measurements: The sparsity of our CS- 
LDPC matrix may yield measurements y that contain less information about the signal x 
than a dense Gaussian <£>. The following theorem, whose proof appears in the Appendix, ver- 
ifies that y retains enough information to decode x well. As long as S = K/N = f2 ^(fif) ^ > 
then M = 0(K\og(N)) measurements are sufficient. 

Theorem 1 Let x be a two-state mixture Gaussian signal with sparsity rate S = K/N 
and variances <Tq and a\, and let $ be a CS-LDPC matrix with constant row weight L = 
r] ln( - SN s — —, where 77,7 > 0. If 



M = 



(1 + 2t 7 - 1 )(1 + 7 ) 
V 2 



2K + (N - K) ( — 



2 



log(iV) , (2) 



then x can be decoded to x such that \\x — x\\oo < /iai with probability 1 — 2N 7 . 

The proof of Theorem 1 relies on a result by Wang et al. [49, Theorem 1]. Their proof 
partitions <£> into M2 sub-matrices of Mi rows each, and estimates each xi as a median of inner 
products with sub-matrices. The performance guarantee relies on the union bound; a less 
stringent guarantee yields a reduction in M 2 . Moreover, L can be reduced if we increase the 
number of measurements accordingly. Based on numerical results, we propose the following 
modified values as rules of thumb, 

S- 1 = N/K, M = 0(K\og(N)), and R = LM/N = 0(\og(N)). (3) 

Noting that each measurement requires O(L) additions and subtractions, and using our rules 
of thumb for L and M (3), the computation required for encoding is O(LM) = 0(Nlog(N)), 
which is significantly lower than the O(MN) = 0(KN\og(N)) required for dense Gaussian 



4 CS-BP decoding of approximately sparse signals 

Decoding approximately sparse random signals can be treated as a Bayesian inference prob- 
lem. We observe the measurements y = $x, where x is a mixture Gaussian signal. Our goal is 
to estimate x given $ and y. Because the set of equations y = &x is under-determined, there 
are infinitely many solutions. All solutions lie along a hyperplane of dimension iV — M. We 
locate the solution within this hyperplane that best matches our prior signal model. Consider 
the minimum mean square error (MMSE) and maximum a posteriori (MAP) estimates, 

^mmse = arg min£ , ||X - x'\\\ s.t. y — ®x', 

x' 

^map = ar g max/(X = a; / ) s.t. y = Qx', 

x' 

where the expectation is taken over the prior distribution for X. The MMSE estimate can 
be expressed as the conditional mean, x M mse = E [X\Y — y], where Y e M M is the random 
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vector that corresponds to the measurements. Although the precise computation of Smmse 
may require the evaluation of 2 N terms, a close approximation to the MMSE estimate can be 
obtained using the (usually small) set of state configuration vectors q with dominant posterior 
probability [27]. Indeed, exact inference in graphical models is NP-hard [50], because of loops 
in the graph induced by However, the sparse structure of <3> reduces the number of loops 
and enables us to use low-complexity message-passing methods to estimate x approximately. 

4.1 Decoding algorithm 

We now employ belief propagation (BP), an efficient method for solving inference problems by 
iteratively passing messages over graphical models [30-35]. Although BP has not been proved 
to converge, for graphs with few loops it often offers a good approximation to the solution 
to the MAP inference problem. BP relies on factor graphs, which enable fast computation of 
global multivariate functions by exploiting the way in which the global function factors into 
a product of simpler local functions, each of which depends on a subset of variables [51]. 

Factor graph for CS-BP: The factor graph shown in Figure 2 captures the relationship 
between the states q, the signal coefficients x, and the observed CS measurements y. The 
graph is bipartite and contains two types of vertices; all edges connect variable nodes (black) 
and constraint nodes (white). There are three types of variable nodes corresponding to 
state variables Q(i), coefficient variables X(i), and measurement variables Y(j). The factor 
graph also has three types of constraint nodes, which encapsulate the dependencies that 
their neighbors in the graph (variable nodes) are subjected to. First, prior constraint nodes 
impose the Bernoulli prior on state variables. Second, mixing constraint nodes impose the 
conditional distribution on coefficient variables given the state variables. Third, encoding 
constraint nodes impose the encoding matrix structure on measurement variables. 

Message passing: CS-BP approximates the marginal distributions of all coefficient 
and state variables in the factor graph, conditioned on the observed measurements Y, by 
passing messages between variable nodes and constraint nodes. Each message encodes the 
marginal distributions of a variable associated with one of the edges. Given the distributions 
Pv(Q(i)\Y = y) and f(X(i)\Y = y), one can extract MAP and MMSE estimates for each 
coefficient. 

Denote the message sent from a variable node v to one of its neighbors in the bipartite 

graph, a constraint node c, by fi v >c (i>); a message from c to v is denoted by /i c > v (v). The 

message ji v >c (i>) is updated by taking the product of all messages received by v on all other 

edges. The message /i c > v {v) is computed in a similar manner, but the constraint associated 

with c is applied to the product and the result is marginalized. More formally, 

Vv-^c(v) = JJ /J u -^ v (v), (4) 

u£n(v)\{c} 

Hc—+v(v) = ^2 con(ra(c)) JJ n w _ c (w) , (5) 

y wen(c)\{v} J 

where n{y) and n(c) are sets of neighbors of v and c, respectively, con(n(c)) is the constraint 
on the set of variable nodes n(c), and ~ {v} is the set of neighbors of c excluding v. We 
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interpret these 2 types of message processing as multiplication of beliefs at variable nodes (4) 
and convolution at constraint nodes (5). Finally, the marginal distribution f(v) for a given 
variable node is obtained from the product of all the most recent incoming messages along 
the edges connecting to that node, 

f( v ) = n (6) 

Based on the marginal distribution, various statistical characterizations can be computed, 
including MMSE, MAP, error bars, and so on. 

We also need a method to encode beliefs. One method is to sample the relevant pdf 's uni- 
formly and then use the samples as messages. Another encoding method is to approximate 
the pdf by a mixture Gaussian with a given number of components, where mixture param- 
eters are used as messages. These two methods offer different trade-offs between modeling 
flexibility and computational requirements; details appear in Sections 4.2 and 4.3. We leave 
alternative methods such as particle filters and importance sampling for future research. 

Protecting against loopy graphs and message quantization errors: BP converges 
to the exact conditional distribution in the ideal situation where the following conditions are 
met: (i) the factor graph is cycle-free; and (ii) messages are processed and propagated 
without errors. In CS-BP decoding, both conditions are violated. First, the factor graph 
is loopy — it contains cycles. Second, message encoding methods introduce errors. These 
non-idealities may lead CS-BP to converge to imprecise conditional distributions, or more 
critically, lead CS-BP to diverge [52-54]. To some extent these problems can be reduced 
by (i) using CS-LDPC matrices, which have a relatively modest number of loops; and (ii) 
carefully designing our message encoding methods (Sections 4.2 and 4.3). We stabilize CS-BP 
against these non-idealities using message damped belief propagation (MDBP) [55], where 
messages are weighted averages between old and new estimates. Despite the damping, CS- 
BP is not guaranteed to converge, and yet the numerical results of Section 5 demonstrate 
that its performance is quite promising. We conclude with a prototype algorithm; Matlab 
code is available at http://dsp.rice.edu/CSBP. 

CS-BP Decoding Algorithm 

1. Initialization: Initialize the iteration counter i — 1. Set up data structures for factor 

graph messages /i v iC (v) and /i c > v (v). Initialize messages /i v >c (v) from variable to 

constraint nodes with the signal prior. 

2. Convolution: For each measurement c = 1, . . . , M, which corresponds to constraint 

node c, compute /j, c > v (v) via convolution (5) for all neighboring variable nodes n(c). 

If measurement noise is present, then convolve further with a noise prior. Apply 
damping methods such as MDBP [55] by weighting the new estimates from iteration % 
with estimates from previous iterations. 

3. Multiplication: For each coefficient v = 1, . . . , N, which corresponds to a variable 

node v, compute fi v >c (v) via multiplication (4) for all neighboring constraint nodes 

n(v). Apply damping methods as needed. If the iteration counter has yet to reach its 
maximal value, then go to Step 2. 
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4. Output: For each coefficient v = 1,...,N, compute MMSE or MAP estimates (or 
alternative statistical characterizations) based on the marginal distribution f(v) (6). 
Output the requisite statistics. 

4.2 Samples of the pdf as messages 

Having described main aspects of the CS-BP decoding algorithm, we now focus on the 
two message encoding methods, starting with samples. In this method, we sample the pdf 
and send the samples as messages. Multiplication of pdf's (4) corresponds to point-wise 
multiplication of messages; convolution (5) is computed efficiently in the frequency domain. 4 

The main advantage of using samples is flexibility to different prior distributions for 
the coefficients; for example, mixture Gaussian priors are easily supported. Additionally, 
both multiplication and convolution are computed efficiently. However, sampling has large 
memory requirements and introduces quantization errors that reduce precision and hamper 
the convergence of CS-BP [52]. Sampling also requires finer sampling for precise decoding; 
we propose to sample the pdf's with a spacing less than er - 

We analyze the computational requirements of this method. Let each message be a vector 
of p samples. Each iteration performs multiplication at coefficient nodes (4) and convolution 
at constraint nodes (5). Outgoing messages are modified, 



where the denominators are non-zero, because mixture Gaussian pdf's are strictly positive. 
The modifications (7) reduce computation, because the numerators are computed once and 
then reused for all messages leaving the node being processed. 

Assuming that the column weight R is fixed (Section 3), the computation required for 
message processing at a variable node is O(Rp) per iteration, because we multiply R + 1 
vectors of length p. With O(N) variable nodes, each iteration requires O(NRp) computa- 
tion. For constraint nodes, we perform convolution in the frequency domain, and so the 
computational cost per node is 0(Lp\og(p)). With O(M) constraint nodes, each iteration 
is 0(LMp \og(p)). Accounting for both variable and constraint nodes, each iteration is 
0(NRp + LMp\og(p)) = 0(p\og(p)N 4og(iV)) , where we employ our rules of thumb for L, 
M, and R (3). To complete the computational analysis, we note first that we use 0(log(iV)) 
CS-BP iterations, which is proportional to the diameter of the graph [56]. Second, sampling 
the pdf's with a spacing less than cq, we choose p = 0(<7i/(Xo) to support a maximal ampli- 
tude on the order of o\. Therefore, our overall computation is O ( — log ( — ] Nlog 2 (N) ) , 



4.3 Mixture Gaussian parameters as messages 

In this method, we approximate the pdf by a mixture Gaussian with a maximum number 
of components, and then send the mixture parameters as messages. For both multiplication 

4 Fast convolution via FFT has been used in LDPC decoding over GF(2 q ) using BP [34]. 




and 






which scales as 0(N \og 2 (N)) when a and o\ are constant. 
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Table 1: Computational and storage requirements of CS-BP decoding 



Messages 


Parameter 


Computation 


Storage 


Samples of pdf 
Mixture Gaussians 


p = 0(<7i/<7o) samples 
m components 


0(m 2 f log 2 (A0) 


0(pN\og(N)) 
0(mN\og(N)) 



(4) and convolution (5), the resulting number of components in the mixture is multiplicative 
in the number of constituent components. To keep the message representation tractable, we 
perform model reduction using the Iterative Pairwise Replacement Algorithm (IPRA) [57], 
where a sequence of mixture models is computed iteratively. 

The advantage of using mixture Gaussians to encode pdf 's is that the messages are short 
and hence consume little memory. This method works well for mixture Gaussian priors, but 
could be difficult to adapt to other priors. Model order reduction algorithms such as IPRA 
can be computationally expensive [57], and introduce errors in the messages, which impair 
the quality of the solution as well as the convergence of CS-BP [52]. 

Again, we analyze the computational requirements. Because it is impossible to undo the 
multiplication in (4) and (5), we cannot use the modified form (7). Let m be the maximum 
model order. Model order reduction using IPRA [57] requires 0{m 2 R 2 ) computation per co- 
efficient node per iteration. With O(N) coefficient nodes, each iteration is 0(m 2 R 2 N). Sim- 
ilarly, with O(M) constraint nodes, each iteration is 0(m 2 L 2 M). Accounting for 0(log(N)) 
CS-BP iterations, overall computation is 0(m 2 [L 2 M + R 2 N] \og(N)) = O (m 2 f log 2 (N)) . 

4.4 Properties of CS-BP decoding 

We briefly describe several properties of CS-BP decoding. The computational characteristics 
of the two methods for encoding beliefs about conditional distributions were evaluated in 
Sections 4.2 and 4.3. The storage requirements are mainly for message representation of the 
LM = 0(Nlog(N)) edges. For encoding with pdf samples, the message length is p, and so 
the storage requirement is 0(pN log(N)). For encoding with mixture Gaussian parameters, 
the message length is m, and so the storage requirement is 0(mN\og(N)). Computational 
and storage requirements are summarized in Table 1. 

Several additional properties are now featured. First, we have progressive decoding; more 
measurements will improve the precision of the estimated posterior probabilities. Second, if 
we are only interested in an estimate of the state configuration vector q but not in the coeffi- 
cient values, then less information must be extracted from the measurements. Consequently, 
the number of measurements can be reduced. Third, we have robustness to noise, because 
noisy measurements can be incorporated into our model by convolving the noiseless version 
of the estimated pdf (5) at each encoding node with the pdf of the noise. 

5 Numerical results 

To demonstrate the efficacy of CS-BP, we simulated several different settings. In our first 
setting, we considered decoding problems where N = 1000, S = 0.1, o"i = 10, cr = 1, and the 
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M 

Figure 3: MMSE as a function of the number of measurements M using different matrix row 
weights L. The dashed lines show the £2 norms of x (top) and the small coefficients (bottom). 
(N = 1000, S = 0.1, o"i = 10, (To = 1, and noiseless measurements.) 

measurements are noiseless. We used samples of the pdf as messages, where each message 
consisted of p = 525 = 3 • 5 2 • 7 samples; this choice of p provided fast FFT computation. 
Figure 3 plots the MMSE decoding error as a function of M for a variety of row weights 
L. The figure emphasizes with dashed lines the average £2 norm of x (top) and of the small 
coefficients (bottom); increasing M reduces the decoding error, until it reaches the energy 
level of the small coefficients. A small row weight, e.g., L = 5, may miss some of the large 
coefficients and is thus bad for decoding; as we increase L, fewer measurements are needed 
to obtain the same precision. However, there is an optimal L opt « 2/5 = 20 beyond which 
any performance gains are marginal. Furthermore, values of L > L opt give rise to divergence 
in CS-BP, even with damping. An example of the output of the CS-BP decoder and how it 
compares to the signal x appears in Figure 4, where we used L = 20 and M = 400. Although 
iV = 1000, we only plotted the first 100 signal values x(i) for ease of visualization. 

To compare the performance of CS-BP with other CS decoding algorithms, we also 
simulated: (i) £\ decoding (1) via linear programming; (ii) GPSR [20], an optimization 
method that minimizes ||#||i + \x\y — (Hi) CoSaMP [16], a fast greedy solver; and 

(iv) IHT [17], an iterative thresholding algorithm. We simulated all five methods where 
N = 1000, S = 0.1, L = 20, o"i = 10, Co = 1, p — 525, and the measurements are noiseless. 
Throughout the experiment we ran the different methods using the same CS-LDPC encoding 
matrix $, the same signal x, and therefore same measurements y. Figure 5 plots the MMSE 
decoding error as a function of M for the five methods. For small to moderate M, CS-BP 
exploits its knowledge about the approximately sparse structure of x, and has a smaller 
decoding error. CS-BP requires 20-30% fewer measurements than the optimization methods 
LP and GPSR to obtain the same MMSE decoding error; the advantage over the greedy 
solvers IHT and CoSaMP is even greater. However, as M increases, the advantage of CS-BP 
over LP and GPSR becomes less pronounced. 

To compare the speed of CS-BP to other methods, we ran the same five methods as 
before. In this experiment, we varied the signal length N from 100 to 10000, where S = 0.1, 
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Figure 4: Original signal x and version decoded by CS-BP. (N 
o\ = 10, (Tq = 1, and noiseless measurements.) 



1000, 5 = 0.1, L = 20, M = 400, 



L = 20, o~i = 10, Co = 1, p — 525, and the measurements are noiseless. We mention in 
passing that some of the algorithms that were evaluated can be accelerated using linear 
algebra routines optimized for sparse matrices; the improvement is quite modest, and the 
run-times presented here do not reflect this optimization. Figure 6 plots the run-times of the 
five methods in seconds as a function of N. It can be seen that LP scales more poorly than 
the other algorithms, and so we did not simulate it for N > 3000. 5 CoSaMP also seems to 
scale relatively poorly, although it is possible that our conjugate gradient implementation 
can be improved using the pseudo- inverse approach instead [16]. The run-times of CS-BP 
seem to scale somewhat better than IHT and GPSR. Although the asymptotic computational 
complexity of CS-BP is good, for signals of length N = 10000 it is still slower than IHT 
and GPSR; whereas IHT and GPSR essentially perform matrix-vector multiplications, CS- 
BP is slowed by FFT computations performed in each iteration for all nodes in the factor 

graph. Additionally, whereas the choice p = 0(01/00) yields O l — \og( — ) N\og 2 (N) 



£1 

(To \^cr 

complexity, FFT computation with p = 525 samples is somewhat slow. That said, our main 
contribution is a computationally feasible Bayesian approach, which allows to reduce the 
number of measurements (Figure 5); a comparison between CS-BP and previous Bayesian 
approaches to CS [25, 26] would be favorable. 

To demonstrate that CS-BP deals well with measurement noise, recall the noisy mea- 
surement setting y = $2 + z of Section 3, where z ~ A^(0, a 2 z ) is AWGN with variance o\. 
Our algorithm deals with noise by convolving the noiseless version of the estimated pdf (5) 
with the noise pdf. We simulated decoding problems where N = 1000, S = 0.1, L = 20, 
01 = 10, O = 1, p = 525, and o\ e {0,2,5, 10}. Figure 7 plots the MMSE decoding error 



'Our LP solver is based on interior point methods. 
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Figure 5: MMSE as a function of the number of measurements M using CS-BP, linear programming 
(LP), GPSR, CoSaMP, and IHT. The dashed lines show the £2 norms of x (top) and the small 
coefficients (bottom). (N = 1000, S = 0.1, L = 20, o\ = 10, do = I, and noiseless measurements.) 

as a function of M and erf. To put things in perspective, the average measurement picks 
up a Gaussian term of variance L(l — S)o-q = 18 from the signal. Although the decoding 
error increases with exf , as long as erf <C 18 the noise has little impact on the decoding error; 
CS-BP offers a graceful degradation to measurement noise. 

Our final experiment considers model mismatch where CS-BP has an imprecise statistical 
characterization of the signal. Instead of a two-state mixture Gaussian signal model as 
before, where large coefficients have variance o~\ and occur with probability S, we defined 
a C-component mixture model. In our definition, o\ is interpreted as a background signal 
level, which appears in all coefficients. Whereas the two-state model adds a "true signal" 
component of variance o\ — cxg to the background signal, the C — 1 large components each 
occur with probability S and the amplitudes of the true signals are 02, 2o"2, . . . , (C — l)o"2, 
where o~i is chosen to preserve the total signal energy. At the same time, we did not change 
the signal priors in CS-BP, and used the same two-state mixture model as before. We 
simulated decoding problems where N = 1000, S = 0.1, L = 20, o~ 1 = 10, cr = 1, p — 525, 
the measurements are noiseless, and C G {2, 3, 5}. Figure 8 plots the MMSE decoding error 
as a function of M and C. The figure also shows how IHT and GPSR perform, in order to 
evaluate whether they are more robust than the Bayesian approach of CS-BP. We did not 
simulate CoSaMP and l\ decoding, since their MMSE performance is comparable to that of 
IHT and GPSR. As the number of mixture components C increases, the MMSE provided 
by CS-BP increases. However, even for C = 3 the sparsity rate effectively doubles from S to 
25, and an increase in the required number of measurements M is expected. Interestingly, 
the greedy IHT method also degrades significantly, perhaps because it implicitly makes an 
assumption regarding the number of large mixture components. GPSR, on the other hand, 
degrades more gracefully. 
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Figure 6: Run-time in seconds as a function of the signal length N using CS-BP, linear program- 
ming (LP) li decoding, GPSR, CoSaMP, and IHT. (S = 0.1, L = 20, M = OAN, u x = 10, cr = 1, 
and noiseless measurements.) 

6 Variations and enhancements 

Supporting arbitrary sparsifying basis ty: Until now, we have assumed that the canon- 
ical sparsifying basis is used, i.e., \& = J. In this itself is sparse. We now ex- 
plain how CS-BP can be modified to support the case where x is sparse in an arbitrary 
basis In the encoder, we multiply the CS-LDPC matrix $ by \1/ T and encode x as 
y = ($\I /T )x = (^^> T )(^>9) = $6 I , where (-) T denotes the transpose operator. In the decoder, 
we use BP to form the approximation 9, and then transform via f to J = *&9. In order to 
construct the modified encoding matrix $\I/ T and later transform 9 to x, extra computation 
is needed; this extra cost is 0(N 2 ) in general. Fortunately, in many practical situations \I/ 
is structured (e.g., Fourier or wavelet bases) and amenable to fast computation. Therefore, 
extending our methods to such bases is feasible. 

Exploiting statistical dependencies: In many signal representations, the coefficients 
are not iid. For example, wavelet representations of natural images often contain correlations 
between magnitudes of parent and child coefficients [2,43]. Consequently, it is possible to 
decode signals from fewer measurements using an algorithm that allocates different distribu- 
tions to different coefficients [46,58]. By modifying the dependencies imposed by the prior 
constraint nodes (Section 4.1), CS-BP decoding supports different signal models. 

Feedback: Feedback from the decoder to the encoder can be used in applications where 
measurements may be lost because of transmissions over faulty channels. In an analogous 
manner to a digital fountain [59], the marginal distributions (6) enable us to identify when 
sufficient information for signal decoding has been received. At that stage, the decoder 
notifies the encoder that decoding is complete, and the stream of measurements is stopped. 

Irregular CS-LDPC matrices: In channel coding, LDPC matrices that have irregular 
row and column weights come closer to the Shannon limit, because a small number of rows 
or columns with large weights require only modest additional computation yet greatly reduce 
the block error rate [38]. In an analogous manner, we expect irregular CS-LDPC matrices 
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Figure 7: MMSE as a function of M using different noise levels a 2 z . The dashed lines show the £2 
norms of x (top) and the small coefficients (bottom). (N = 1000, S = 0.1, L = 20, o\ = 10, and 
a = I.) 

to enable a further reduction in the number of measurements required. 

7 Discussion 

This paper has developed a sparse encoding matrix and belief propagation decoding algo- 
rithm to accelerate CS encoding and decoding under the Bayesian framework. Although 
we focus on decoding approximately sparse signals, CS-BP can be extended to signals that 
are sparse in other bases, is flexible to modifications in the signal model, and can address 
measurement noise. 

Despite the significant benefits, CS-BP is not universal in the sense that the encoding 
matrix and decoding methods must be modified in order to apply our framework to arbitrary 
bases. Nonetheless, the necessary modifications only require multiplication by the sparsifying 
basis \l/ or its transpose \I/ T . 

Our method resembles low density parity check (LDPC) codes [37,38], which use a 
sparse Bernoulli parity check matrix. Although any linear code can be represented as a 
bipartite graph, for LDPC codes the sparsity of the graph accelerates the encoding and 
decoding processes. LDPC codes are celebrated for achieving rates close to the Shannon 
limit. A similar comparison of the MMSE performance of CS-BP with information theoretic 
bounds on CS performance is left for future research. Additionally, although CS-BP is not 
guaranteed to converge, the recent convergence proofs for LDLC codes [36] suggest that 
future work on extensions of CS-BP may also yield convergence proofs. 

In comparison to previous work on Bayesian aspects of CS [25,26], our method is much 
faster, requiring only 0(Nlog 2 (N)) computation. At the same time, CS-BP offers significant 
flexibility, and should not be viewed as merely another fast CS decoding algorithm. However, 
CS-BP relies on the sparsity of CS-LDPC matrices, and future research can consider the 
applicability of such matrices in different applications. 
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Figure 8: MMSE as a function of the number of measurements M and the number of components 
C in the mixture Gaussian signal model. Plots for CS-BP (x), GPSR (circle), and IHT (asterisk) 
appear for C = 2 (dotted), C = 3 (dashed), and C = 5 (solid). The horizontal dashed lines show 
the £2 norms of x (top) and the small coefficients (bottom). (N = 1000, S = 0.1, L = 20, o\ = 10, 
ctq = 1, and noiseless measurements.) 



Outline of proof of Theorem 1: The proof begins with a derivation of probabilistic 
bounds on ||x||2 and ||#||oo- Next, we review a result by Wang et al. [49, Theorem 1]. The 
proof is completed by combining the bounds with the result by Wang et al. 



Upper bound on \\x\\\: Consider 
has a mixture distribution 



I r*> I I 2 
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where the random variable (RV) Xj 



X 2 af w.p. S 
X 2 al w.p 
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Recall the moment generating function (MGF), Mx{t) 

(1-2*)" 



E[e tx }. The MGF of a Chi-squared 



RV satisfies M x 2 (t) 



For the mixture RV Xf, 
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2ta\ 



+ 



1 - S 
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Additionally, because the Xj are iid, Miuip^) = M x ?{t) 



N 



Invoking the Chernoff bound, 
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we have 



Pr (\\x\\l < SNaf) < e 



-tSNaf 



+ 



1-S 



y/1 - 2ta 2 y/l - 2ta t 



N 



for t < 0. We aim to show that Pr (||x||| < SNaf) decays faster than N 7 as N is increased. 
To do so, let t = — where a > 0. It suffices to prove that there exists some a for which 



A (a) = e 



aS 



S 



+ 



1-S 



l + 2a 



< i. 



Let /2(a) = v / 1 ^ 2 q. anc ^ A^) = eQ? - ^ is easn y seen via Taylor series that /2(a) = 1 — a + 
0(a 2 ) and /3(a) = 1 + a + 0(a 2 ), and so 



A (a) 



5 (1 _ Q + (a 2 )) + (1 - S) (l - a (^j 2 + O (a 2 (^j ^ j 



[l + aS + O^S 2 )] 



l-a[ S+(l 



0(a 2 



Because of the negative term — a(l — S) < 0, which dominates the higher order term 

0(a 2 ) for small a, there exists a > 0, which is independent of N, for which /1(a) < 1. Using 
this a, the Chernoff bound provides an upper bound on Pr(||a;||| < SNaf) that decays 
exponentially with N. In summary, 

Pr (\\x\\l < SNaf) = o{N~ r ). (8) 

Lower bound on In a similar manner, MGF's and the Chernoff bound can be 

used to offer a probabilistic bound on the number of large Gaussian mixture components 



N 



(9) 



j=i 



Taking into account the limited number of large components and the expected squared £2 
norm, = N[Saf + (1 - S)a 2 }, we have 



Pr {\\x\\l > N[2Sa\ + (1 - S)a 2 ]) = o(N^). 



(10) 



We omit the (similar) details for brevity. 

Bound on ||:c||oo : The upper bound on ||x||oo is obtained by first considering large 
mixture components and then small components. First, we consider the large Gaussian 
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mixture components, and denote Xl = : Q(i) = 1}. 

N 



Pr WxlWk < y/QMSN 1 ^)^ 



£Q(0<|sw) > [/ 4 (V21n(^i + 7))] |5JV (11) 
i=i / 
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1 - 



2^2 ln^iV^) x/2tF 
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4 v /ln(S'iV 1 +T) ' 



where fJa) = -A= f a e u l 2 du is the cumulative distribution function of the standard nor- 
mal distribution, the inequality (11) relies on /4(-) < 1 and the possibility that ^2f =1 Q(i) 
is strictly smaller than fSTV, /5(a) = ^^ e ~ a ^ 2 is the pdf of the standard normal distribu- 
tion, (12) relies on the bound /4(a) > 1 — f 5 (a)/a, and the inequality (13) is motivated by 
(1 — a) 13 > 1 — a (3 for a, (3 > 0. Noting that ln(S'iV 1+7 ) increases with N, for large N we 
have 



Pr H^IU < v / 21n(,SiV 1 +7) ( T 1 



N 



y EQ(i)<~SN)>l- 



AT-7 



(14) 



Now consider the small Gaussian mixture components, and denote xs = : Q(i) = 0}. 
As before, 



Pr (Hxslloo < v / 21n(,S'iV 1 +T) < 7i) > 



fJy/2ln(SN^) 
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(15) 
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where in (15) the number of small mixture components is often less than N. Because <j\ > a , 
for large N we have 



AT-7 



Pr (Halloo < y/2U^N^)<r^ > 1 
Combining (9), (14) and (16), for large N we have 

Pr (Halloo < y/2ln(SW+i)cT^ > 1 - ^ 

Result by Wang et al. [49, Theorem 1]: 



(16) 



(17) 
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Theorem 2 ([49]) Consider x G R that satisfies the condition 

i Pf- < Q- (18) 
\\ x h 

In addition, let V be any set of N vectors {v±, . . . ,Vn} C M. N . Suppose a sparse random 
matrix $ G R MxN satisfies 

E[*d=0,E[& ij ] = l,E[*i j ]=8, 
where - = is the fraction of non-zero entries in $. Let 



M _ / o (^g 2 Miv)) ifsQ*>n{\) 

\o{^\o g (N)) ./ s g 2 <o(i) • ^ 

Then with probability at least 1 — iV~ 7 , the random projections -jg^x and jj&Vi can produce 
an estimate ai for x T Vi satisfying 

|a, - x T Vi\ < e||x||2||ui||2, Vi G {1, . . . , N}. 

Application of Theorem 2 to proof of Theorem 1: Combining (8), (10), and (17), 
the union bound demonstrates that with probability lower bounded by 1 — iV~ 7 we have 
II^Hoo < y/2 IniSW+i)^ and ||a;||| G (NSaj, N[2Saj + (1 - S>g]). 6 When these £ 2 and 
bounds hold, we can apply Theorem 2. 

To apply Theorem 2, we must specify (i) Q (18); (ii) the test vectors (vi)f =1 ; (Hi) the 
matrix sparsity s; and (iv) the e parameter. First, the bounds on ||a;||2 and ||a;||oo indicate 

that ^Jjj^- < Q — ^ 2ln % I N 1 ^ - Second, we choose (vi)f =1 to be the N canonical vectors of the 

identity matrix In, providing x T Vi = Xi. Third, our choice of L offers s = ^ = ^(|]frF7) ■ 
Fourth, we set 

_ fJfi 

" y/N[2Sa* + (l-S)a*]' 
Using these parameters, Theorem 2 demonstrates that all N approximations satisfy 

|a» — Xi\ = \ai — x T Vi\ < e\\x\\2\\vi\\2 < fiai 

with probability lower bounded by 1 — iV -7 . Combining the probability that the £2 and 
bounds hold and the decoding probability offered by Theorem 2, we have 

\\a - x||oo < peri (20) 

with probability lower bounded by 1 — 2iV~ 7 . 

We complete the proof by computing the number of measurements M required (19). 
Because sQ 2 = ^ ln(s ^i+ 7) 21n( g^ 1+7) = % we need 

M = o((l + 2^ 1 )^ log(iV)^ = O (n(1 + 2r ] - 1 )^^- 2S + (1 - S) (^j log(iV) 
measurements. □ 



6 The o(-) terms (8) and (10) demonstrate that there exists some Nq such that for all N > No the upper 
and lower bounds on ||x||2 each hold with probability lower bounded by 1 — jN 1 , resulting in a probability 
lower bounded by 1 — 7V~ 7 via the union bound. Because the expression (2) for the number of measurements 
M is an order term, the case where N < Nq is inconsequential. 
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